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Abstract 

Recurrent neural networks (RNNs) in combination with a pooling op- 
erator and the neighbourhood components analysis (NCA) objective func- 
y-^ tion are able to detect the characterizing dynamics of sequences and embed 

them into a fixed-length vector space of arbitrary dimensionality. Subse- 
(Z3 quently, the resulting features are meaningful and can be used for visu- 

alization or nearest neighbour classification in linear time. This kind of 
metric learning for sequential data enables the use of algorithms tailored 
y—( towards fixed length vector spaces such as K n . 

> 

2 1- Introduction 

ON Sequential data is found in many domains including medical applications, robot 

control, neuroscience, financial information or text processing. This data is 
fundamentally different from static data vectors. 

When considering a single sequence over T time steps x = (xi,x 2 , - ■ ■ ,x T ) e 
•*h X* with X C R n , the order of the individual elements Xi is relevant for the 

/\ interpretation. Conversely, in the case of static data x' e K", an ordering on 

the n components is not even defined. Indeed, the key element of structured 
data is that the context (i.e., dependency between the time steps) contains 
essential information to make learning on the data possible. 

An often used simplifying approach is to treat the data as if it were static. As 
we will demonstrate, this assumption makes the detection of behaviour typical 
for sequences impossible. A more promising and principled way is to extract 
higher-order features from the sequences. For instance, given a sequence of robot 
joint positions q(t) 7 standard methods can be used to calculate derivatives q'(t) 
and q"(t), which can then be used as additional data. Alternatively, domain 
knowledge of experts can be used. Nonetheless, such methods usually require 
an innate knowledge of the underlying, data-generating process, which is not 
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always available or efficient. Turning to algorithmically learned features is a 
logical step. 

The authors of (Goldberger et al. 2004) note that metrics and features are ac- 
tually closely related: by measuring pairwise distances between the data points 
x' e R", the data can be embedded into a metric space. They learn a Maha- 
lanobis distance by mapping the high-dimensional data set X to a metric space 
Z in which/c-nearest neighbour classification performance is maximized. The 
resulting objective function is diffcrentiable with respect to the embedding. 

Similar to (Salakhutdinov and Hinton 2007), we use a different model for learn- 
ing the embedding function. Our choice, recurrent neural networks (RNNs), are 
rich models for sequence learning. They have been successfully used for hand- 
writing recognition (Graves and Schmidhuber 2009), audio processing (Graves 
and Schmidhuber 2005), and text modelling (Martens, Sutskever, and Hin- 
ton 2011). Although in principle capable of approximating any measurable 
sequence-to-sequence mapping (Hammer 2000), they are notoriously hard to 
train. We successfully apply them to several data sets by making use of the 
most recent version of a special architecture, the Long Short-Term Memory 
(Hochreiter and Schmidhuber 1997), which overcomes the learning difficulties. 

2. Characteristics of Sequential Data 

We consider a sequence x t G X C K™, with t = 1,2, ... ,T, where X is called 
the sequence space. The sequence space can be a representation of time series, 
as well as nominal data such as text (using "1-of-k" encodings). 

2.1. Requirements for Sequence Metrics 

Ideally, a metric would reflect the different axes of the underlying process and 
only those. To illustrate the difficulty of this, we give a small number of examples 
of what nature these dynamics might be and what a learning machine has to be 
able to detect. The concepts are visualized in figure 1; 

1. Time lags are essentially translations along the time axis. These transla- 
tions might occur in the middle of a sequence, not only at the beginning or 
at the end. Distances that are a sum of the pairwise distances of the form 
D(x, y) = d{xi, Hi) such as the Euclidean (with d{x,y) = ^J(x — y) 2 ) 
or the Hamming distance (with d(x,y) = 1 — I(x = y)) are unable to 
capture this. Sequences might as well be translated in sequence space as 
a whole. The essential part of a sequence might be that it increases over 
the whole time span and thus depends on the difference of two successing 
values and not on their respective differences from some arbitrary origin. 
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Figure 1: Examples of sequence characteristics. Top row, from left to right: 
basic sequence, translation along the time axis, scaling along the time axis. 
Bottom row: translation along the time axis, scaling along the time axis and 
additional Gaussian noise. 

2. The axes might have different scalings. While the actual magnitudes of 
the values at the individual time steps are less significant, the topology 
induced by those values can be crucial. E.g., it might be necessary to 
detect how many spikes are in a sequence whilst the individual heights of 
those hills is of less interest for the task. More extreme cases include a 
drift over time, in which the axes might even be warped nonlinearly. 

3. Lots of time series data are subject to stochasticity. While each time step 
might be distorted by a Gaussian noise term, noise can also occur along 
the time axis. E.g., the time span until a special event happens might be 
random itself. 

A more general way of putting this central requirement is that the metric should 
be able to capture the underlying dynamics. The nature of these dynamics 
is more complicated than the previously mentioned points given in a lot of 
phenomenons: Each of them can either be of negligible or of central importance 
for the given task. 

2.2. Related Work 

Only a few principled approaches exist for extracting fixed length features from 
sequential data. A commonly used practice is based on a set of fixed basis 
functions (e.g. Fourier or wavelet basis). 

While it has strong mathematical guarantees, it is sometimes too inflexible: in 
order to work with arbitrarily long sequences, a sliding time window has to be 
employed, limiting the capability to model context. 

Furthermore, the fixed set of basis functions implies that the problem of iden- 
tifying usefull factors of variations remains unresolved in general. 
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A probabilistic yet supervised approach, is to use a generative model. For each 
class Cj, a conditional p(x\ci) is estimated. The resulting models can be chosen 
to be capable of capturing the characteristics of the data at hand (e.g. hid- 
den Markov models for speech data) and are combined with class distributions 
p(ci) into a Bayesian classifier. The predictive distribution is then given by 
p(cj \xi) = P ^' j j(^ Ci ' ) ■ The resulting posterior likelihoods can then be inter- 
preted as features. Essentially class memberships, they can be deemed not 
expressive enough as they are very abstract. 

Fisher kernels (Jaakkola and Haussler 1998), a combination of probabilistic 
generative models with kernel methods, provide another commonly vectorial 
representation of sequences. The basic idea is that two similar objects induce 
similar gradients of the likelihood for the parameters of the model. 

Thus, the features for a sequence are the elements of the gradient of the log- 
likelihood of this sequence with respect to the model paramters. 

This choice can presumably be very bad: if the distribtution represented by 
the trained model closely resembles the data distribution the gradients for all 
sequences in the data set will be nearly zero. A recent paper (Maaten 2011) 
alleviates this problem by exploiting label information and employing ideas from 
metric learninig. Obviously, this only works if class information is available. 

A fully unsupervised approach is to use the parameters estimated by a system 
identification method (e.g., a linear dynamical system) as features. Recent work 
includes (Li and Prakash 2011), in which a complex numbers based system 
successfully clusters motion capture data. 

The last two approaches clearly suffer from the fact that the number of features 
is directly connected with the complexity of the model. In particular it is not 
given that the important factors of variation are captured by these methods. 

3. Recurrent Neural Networks 

Recurrent neural networks are an extension of feedforward networks which can 
represent sequential data by having an internal state. While a state-free feedfor- 
ward network immediately "forgets" the data it has seen, RNNs have weighted 
connections through time, which means that information can be kept over 
the forward propagation of a single input vector. The inputs to an RNN are 
given as a sequence (xi,x 2 , • • ■ , x T ). Subsequently, a sequence of hidden states 
(hi, h 2 , ■ ■ ■ , hr) and a sequence of outputs (oi, o 2 , . . . , ot) is calculated via the 
following equations: 



h t = <r(W xh x t + W hh h t _i + b h ) 
ot = W ho h t + b Q 



(1) 
(2) 
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Figure 2: Left: Structure of an ordinary RNN. Dashed lines represent recurrent 
connections. Right: LSTM-RNN with a single cell. The red dots represent the 
states St, while to arrows represent summing connections. The dotted connec- 
tions represent multiplicative interactions. Peepholes left out for clarity. 



where t = 1, 2, ... ,T and a is a suitable transfer function, typically the tangent 
hyperbolic, applied element- wise. W x h, Whh, Who are weight matrices while bh 
and b are bias terms. For the calculation of hi a special initial hidden state ho 
has to be used which can be optimized during learning as well. 

The dimensionality of the adaptable parameters W x h> Whh, Who, bh, b a and 
ho is determined by the given input and output dimensions and the size of the 
hidden layer. Given an input size /, a hidden size H and an output size O the 
following dimensionalities are met: W xh £ R IxH , W hh G R HxH , W ho G R Hx °, 
b h ,h £ R H and b e M°. 

The structure of RNNs is illustrated in figure [2j 

RNNs have a lot of expressive power since their states are distributed and nonlin- 
ear dynamics can be modelled. The calculation of their gradients is astonishingly 
easy via Backpropagation Through Time (BPTT) (Mozer 1989) or Real-Time 
Recurrent Learning (Williams and Zipser 1995). The guiding mathematical tool 
is the chain rule, which can be applied "through time" as well. However, 1st 
order gradient methods completely fail to capture relations that are more than 
as little as ten time steps apart of each other. 

This problem is called the vanishing gradient and has been studied by (Hochre- 
iter 1991) and (Bengio, Simard, and Frasconi 1994). The state of the art method 
to overcome this has been the Long Short-Term Memory (LSTM) (Hochre- 
iter and Schmidhuber 1997) for more than ten years. Recently (Martens and 
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Sutskever 2011) introduced a second-order optimization method for RNNs, the 
Hessian free optimizer (HF-RNN), which is able to cope with aforementioned 
long term dependencies as well, outperforming LSTM on several benchmarks. In 
this work, we stick to LSTM since the HF-RNN is tailored towards convex loss 
functions — neighbourhood component analysis (NCA), the objective function of 
choice in this paper, is however not convex. 

Another neural model for nonlinear dynamical systems is the echo state network 
approach introduced in (Jaeger and Haas 2004). The drawback of this method 
is that the dynamics that are to be modelled have to be already present in the 
network's random initialization. 

3.1 Recurrent Networks are Differentiable Sequence Ap- 
proximators 

One consequence of the differentiability of RNNs is that we can optimize their 
parameters with respect to an objective function. 1 Stochastic gradient descent 
or higher order techniques are the techniques of choice to fit the weights. 

A long overlooked but obvious potential is to reduce output sequences to a 
single vector with a pooling operation. A pooling operation is a function p : 
X* — > X that reduces an undefined amout of inputs to a single output of the 
same set, e.g. taking the sum or picking the maximum. Similar to convolutional 
neural networks, we can use this technique to reduce a sequence to a point. 
If our pooling operation is differentiable as well, we can use it as a gateway to 
arbitrary objective functions that are defined on real vectors. Given a network / 
parametrized by W, a data set V = {xi}, a pooling operation p and an objective 
function O we proceed as follows: 

1. Process input sequences = (x a , . . . ,XiT),x it e R" to produce output 
sequences /(x;; W) = o { = (o ib . . .,o iT ),o lt E K m , 

2. Use a pooling operation p to reduce the output sequences to a points via 
p(oa, . . . ,o lT ) = ei, 

3. Calculate the objective function 0({ei}). 

Since the whole calculation is differentiable, we can evaluate the derivative of 
the objective function with respect to the parameters of the RNN via 

dOdpdj_ 

dp dfdW { ' 

Subsequently, we can use the gradients to find embeddings {e^} of our data 
which optimize the objective function. We apply this insight to combine RNNs 

lr The authors recommend to use automatic or symbolic differentiation, fn this work, 
Thcano (Bcrgstra et al. 2010) was used. 
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with neighbourhood components analysis (NCA), which we will introduce in 
section 4. 



3.2 Long Short-Term Memory 

The capability of LSTM cells to relate events in sequences more than hundreds of 
time steps apart is attributed to a special building block. These so called gating 
units implement a diffcrentiable version of the if ... then . . . construct 
found in programming languages. We define 4>{c,v) — va(c) with a being the 
sigmoid function ranging from to 1. Here, c can be seen as the condition 

that controls v: if c is very low (representing false) the output is 0. If it is very 
high (representing true) the output is v. 

A central concept are the states (si, S2, • . • , st) of the cell. These can be altered 
by the inputs via the input, forget and output gate. We will now give the 
formulas for a recurrent neural network with LSTM cells. To keep the notation 
uncluttered, we concatenated the four different inputs a[ to the cell into a 
single vector. As indicated by the superscript, each of the represents an 
input to one of the gates i, f and o. The superscript x represents the input to 
the cell itself. 



[a[ x) ajp a[ f) a^] = W ha h t . x +W xa x t + b a (4) 
s t = ^^a^H ^WO (5) 

input gate j or get gate 

fH = a(0(a t (o) , St )) (6) 
v v ' 

outputgate 

o t = Whoht + b h 

Given an input size /, a hidden size H and an output size of O the parameters 
have the following dimensionalities: W xa G R Ix4H ,a t e R h ,s t <E R h ,W ha € 
M. Hx4H ,b a € M. 4H and b Q € R°. Recurrency is achieved twofold: first, in equation 
Q via the weight matrix Who, and second in the forget gate in the second term 
of equation ([5| . The latter connection is not parametrized and thus sometimes 
referred to as a constant error carousel. 

A major improvement of the LSTM cells was the introduction of peepholes by 
(Gers, Schraudolph, and Schmidhuber 2003). Additional connections from the 
states to the gates make it possible to learn precise timings. This results in ad- 
ditional learnable parameters Pi,Pf,p € ■ Letting represent the pairwise 
multiplication of vectors, we change equations §5§ and ^ in the following way: 

s t = c/)(a { t } +p t Qst- 1 ,a ( t x) ) + <j>{a u " > + Pf Q s t -i,s t -i) (7) 
h t = cr(<j){a ( t o} + Po &st,s t )). (8) 
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A network with LSTM cells is shown in figure [2] 



4. Sequential Neighbourhood Components Anal- 
ysis 

The central assumption of neighbourhood components analysis (Goldberger et 
al. 2004; Salakhutdinov and Hinton 2007) is that items of the same class lie 
near each other on a lower-dimensional manifold. To exploit this, we want to 
learn a function / : X — > Z from the sequence space X to a metric space Z that 
reflects this. 

The resulting embeddings are tailored towards good performance in combina- 
tion with the /c-nearest neighbour algorithm. The embeddings are however not 
limited to this. Other approaches, such as DrLim (Hadsell, Chopra, and Le- 
cun 2006) use the same idea in combination with energy based models to learn 
metrics. The resulting embeddings can be used in conjunction with any al- 
gorithm working on static data. Practical results have been shown that large 
margins separate the distinct classes. In some cases, points of the same class 
form multiple clusters. 

In our case, the embedding function is given as e(x; W) = p o f. A recurrent 
neural network / is used to map sequences over M. 1 to sequences over R° of 
equal length. The resulting output sequence is then reduced to a single point 
via the pooling operation p. 

Given a set of sequences with an associated class label T> = {a^c,} mapped to 
a set of embeddings £ — {e(xi; W) = e^} C Z, we define the probability that a 
point a selects another point b as its neighbour based on pairwise distances d a b 
as 

p ab = V ° XP( '^ v 0) 



while the probability that a point selects itself as a neighbour is set to zero: 
Paa = 0. The pairwise distances are determined by the Euclidean distances of 
the respective embeddings: d a b — \ \e a — e&|| 2 . 

The probability that a point i belongs to a certain class k depends on the classes 
of the points in its neighbourhood 

p(ci =k) = 22pijl(cj = k), 

3 



where I is the indicator function that returns 1 if the argument is true and 
otherwise. We are now set to state the overall objective function: we maximize 
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the expected number of correctly classified points 

i 3 

This objective is can be optimized as shown in section 3.1 with equation |3|. 

It should be noted, that NCA is capable of adjusting the number of neighbours 
for each point: equation (|9| shows that the contribution of a neighbour goes to 
zero exponentially fast with its distance. Thus, a simple scaling of the whole 
coordinate system of the embedding space can be perceived as a soft selection 
for the amount of neighbours to consider for a class prediction. 

4.1 Classifying Sequences 

We first train an RNN on our data set with the NCA objective function. After- 
wards, all training sequences are propagated through the network and the pool- 
ing operator to obtain embeddings £ — {e.;} for each of them. We then build 
a nearest neighbour classifier for which we use all embeddings of the training 
set. A new sequence (x\, X2, ■ ■ ■ , Xt) is classified by first forward propagating 
it through the RNN and obtaining an embedding. We then find the fc-nearest 
neighbours and obtain the class by a majority vote. 

The complexity of classifying a new sequence given a trained RNN is thus 0(T)+ 
C(f). C(f) is the complexity of a nearest neighbour lookup working in an f- 
dimensional space. In contrast, the state of the art method for time scries 
classification, dynamic time warping, has a complexity of 0(T 2 N) where N is 
the number of sequences in the training data set. 

5. Experiments 

To show that our algorithm works as a classifier we present results on several 
data sets from the UCR Time Series archive (Keogh 2006). By analyzing the 
Cylinder Bell Funnel data set in more detail, we show that the extracted fea- 
tures are actually meaningful. Then we give results on more data sets from the 
UCR archive — although we do not necessarily reach top performance in terms 
of classification error on them, we can see that our algorithm does something 
reasonable. Outstanding results on all of the UCR data sets are achieved by 
Dynamic Time Warping (DTW) in combination with nearest neighbour classi- 
fication (for an overview see (Xi et al. 2006)). 

We then proceed to a real- world data set, namely TIDIGTS to show that our 
method is useful for practical applications. TIDIGITs is similar to the widely 
used data set MNIST as it contains the digits to 9, yet spoken instead of 
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written. For TIDIGITS, the performance of a discriminative model based on 
LSTM-RNNs has been reported in (Graves and Schmidhuber 2005). 

We optimized the parameters with resilient backprop (RPROP) (Riedmiller 
1994). In our experiments, ordinary stochastic gradient descent with momen- 
tum was never able to find a good parameter set. RPROP on the other hand 
found good minima after initial periods where the training error did not improve 
significantly. It also proved very robust towards hyperparameter selection. In 
all experiments, a maximum step size of 1.0, a minimum step size of 10~ 6 , a 
growth factor of 1.1 and a shrinking factor of 0.3 was used. All input data was 
normalized to have zero mean and unit variance. The dimensionality of the 
embeddings was 2 if not mentioned otherwise. As a pooling operator, we chose 
the average: p(oi, . . . , o T ) = ^ J2i °i- 

5.1 Synthetic Data: UCR Time Series 

As noted in (Goldberger et al. 2004) we confirmed that NCA almost never 
overfits in our experiments. Thus, all experiments were conducted by training 
several runs until convergence on the training set. We then report the testing 
error of the set of parameters that performed best on the training set. 

The cylinder bell funnel data set is a synthetic data set that includes all the 
problems with sequences outlined in section 2: translation, scaling and stochas- 
ticity. It consists of equal length, one dimensional time series of three classes. At 
a random position a motif (either cylinder, bell or funnel) occurs for a random 
period of time. 

In the case of a cylinder, the value of the time series increases suddenly and 
stays on its level for a while after which it decreases suddenly to the previous 
level. 

A bell is characterized by a slow, constant increase over several time steps and 
a sudden drop to the base level again. Funnels are quite the opposite of a bell: 
abrupt increasement and slow decreasement to the baselevcl. All items in the 
data set are also subject to Gaussian noise. For the exact formulas, see (Geurts 
2002). 

Although the training data is very limited, consisting of only ten sequences per 
class, we can see that the classes are separated in a meaningful way: The vertical 
axis corresponds to the length of the cylinder, bell or funnel. The horizontal axis 
represents the transition from bell over cylinder to funnel. This is illustrated in 
figure 

We now give the classification results and specific parameters for several more 
data sets. 
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Figure 3: Embeddings found for test sequences of the CBF data set. The green 
cluster represents the bell class, the red cluster the cylinder class and the blue 
cluster the funnel class. Individual sequences have been pointed out to show 
that the vertical axis roughly corresponds to the length of the object while the 
horizontal is indicative of the transition from bell over cylinder to funnel time 
series. 



Data set 


# classes 


# LSTM cells 


Epochs 


Training 


error Testing error 


1KNN 


CBF 


3 


12 


97 


0.972 


0.971 


0.957 


Lighting 2 


2 


10 


100 


0.964 


0.588 


0.721 


Lighting 7 


7 


10 


55 


0.534 


0.458 


0.493 


Two Patterns 


4 


5 


102 


0.676 


0.656 


0.694 



The training and test errors stated are the average probabilities that a point is 
correctly classified by the stochastic classifier used in the formulation of NCA. 
We also report the error for 1-nearest neighbour classification. 

5.2 Real World Data: TIDIGITS 

TIDIGITS is a data set consisting of spoken digits by adult and child speakers. 
We restricted ourselves to the adult speakers. The audio was preprocessed 
with mel-frequency cepstrum coefficient analysis. The setup parameters were 
12 cepstral coefficients, 1 energy coefficient, and 13 first derivatives, giving 26 
coefficients in total. The frame size was 15 ms and the input window was 25 
ms. We used a framesize of 15 ms and an input window of 25 ms. 
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Figure 4: Two dimensional embeddings for the TIDIGITs data set. Left: the 
whole data set with all ten classes. Right: Only the classes corresponding to 
the digits "1", "2" and "4". 

We trained our network on the full data set and on a subset containing only 
the digits "1" , "2" and "4" . In both cases, we trained networks with 40 LSTM 
units for 250 epochs. 



Data set 


Training error 


Testing error 


1KNN 


3 digits 


0.979 


0.981 


0.984 


All digits 


0.601 


0.522 


0.584 



Although these errors are not state of the art, we want to point out that dscrim- 
inative models based on LSTM-RNNs are in fact able to get correct classifica- 
tion rates of more than 99% (Graves and Schmidhuber 2005). As can be seen 
in Figure @, RNNs are indeed able in conjunction with NCA to find low level 
representations on real world sequential data. 

6. Conclusion 

We presented a solution to an important problem — by combining two well es- 
tablished methods we introduced a method to embed sequential data into a 
semantically meaningful metric feature space. Despite not achieving state of 
the art performance on widely used benchmark data sets our method has its 
own value: it has significantly lower complexity than DTW, the current state 
of the art. Furthermore, it is parametric and thus much easier to use for big 
or non-stationary data. Additionally it leads to interpretable features naturally 
and can be used out of the box as a visualization method and data exploration 
tool. 
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The techniques presented here are usable with any RNN structure — we believe 
that the usage of echo state networks (Jaeger and Haas 2004) or multiplicative 
RNNs (Martens, Sutskever, and Hinton 2011) to NCA might yield even better 
results. The visualizations and performances in (Salakhutdinov and Hinton 
2007) are arguably more impressive. But it should be noted that a recurrent 
pendant to deep belief networks is an unadressed problem — our work would 
definitely benefit from its solution. 
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